Relational Databases in Python

Data Wrangling and Relational Databases

In the context of data wrangling, we recommend that databases and SQL only come into play for gathering data or storing data. That is:

Connecting to a database and importing data into a pandas DataFrame (or the analogous data structure in your preferred programming language), then assessing and cleaning that data, or
Connecting to a database and storing data you just gathered (which could potentially be from a database), assessed, and cleaned

These tasks are especially necessary when you have large amounts of data, which is where SQL and other databases excel over flat files.

The two scenarios above can be further broken down into three main tasks:

Connecting to a database in Python
Storing data from a pandas DataFrame in a database to which you're connected, and
Importing data from a database to which you're connected to a pandas DataFrame

This Lesson

For the example in this lesson, we're going to do these in order:

Connect to a database. We'll connect to a SQLite database using SQLAlchemy , a database toolkit for Python.
Store the data in the cleaned master dataset in that database. We'll do this using pandas' to_sql DataFrame method.
Then read the brand new data in that database back into a pandas DataFrame. We'll do this using pandas' read_sql function.

The third one isn’t necessary for this lesson, but often in the workplace, instead of having to download files, scrape web pages, hit an API, etc., you're given a database right at the beginning of a project.

All three of these tasks will be introduced and carried out in the Jupyter Notebook below. These are not quizzes. All of the code is provided for you. Your job is to read and understand each comment and line of code, then run the code.

Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online code editor work space, etc.) and it cannot be automatically downloaded to be generated here. Please access the classroom with your account and manually download the workspace to your local machine. Note that for some courses, Udacity upload the workspace files onto https://github.com/udacity , so you may be able to download them there.

Workspace Information:

Default file path:
Workspace type: jupyter
Opened files (when workspace is loaded): n/a

Data Wrangling in SQL?

Data wrangling can actually be performed in SQL. We believe that pandas is better equipped for gathering (pandas has a huge simplicity advantage in this area), assessing, and cleaning data, so we usually recommend that you use pandas if given the choice. If wrangling in a work setting, sometimes your tool of choice for data wrangling depends on your company infrastructure, though.

Here is an interesting Reddit thread that debates pandas vs. SQL in general and touches on several topics related to data wrangling.